fix: pipeline hangs when submitting from compute nodes#450
fix: pipeline hangs when submitting from compute nodes#450cmeesters merged 4 commits intosnakemake:mainfrom
Conversation
When running snakemake from within a SLURM job (e.g., an interactive session on a compute node), the pipeline would submit jobs but never detect their completion, hanging forever. The RemoteExecutor base class starts a status-checking daemon thread in __init__ before __post_init__ is called. The SLURM plugin's warn_on_jobcontext() in __post_init__ would sleep 5 seconds and then delete SLURM environment variables, but by then the daemon thread had already started and would silently die after its first polling cycle. Fix: move the SLURM environment detection and cleanup into __init__, before super().__init__() starts the daemon thread. Remove the now unnecessary warn_on_jobcontext() method and its 5-second sleep. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughExecutor now performs SLURM-job-context detection and calls Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_cli.py (1)
37-50: Please add a regression test that hitsExecutor.__init__().These tests still build the object with
Executor.__new__()and call__post_init__()directly, so the moved cleanup path inExecutor.__init__()is never exercised. Please add one test that instantiatesExecutor(...)withSLURM_JOB_IDset and patchesRemoteExecutor.__init__()to assert the environment is already cleaned before base initialization.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cli.py` around lines 37 - 50, Add a regression test that constructs the real Executor by calling Executor(...) (not using __new__ + __post_init__) with SLURM_JOB_ID set in os.environ, and patch RemoteExecutor.__init__ to assert that os.environ lacks SLURM_JOB_ID (i.e., the cleanup in Executor.__init__ ran) before delegating to the original RemoteExecutor.__init__; use the same test helpers as other tests (e.g., _make_executor or patch) and ensure the test fails if SLURM_JOB_ID is not removed so the moved cleanup path in Executor.__init__ is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/test_cli.py`:
- Around line 37-50: Add a regression test that constructs the real Executor by
calling Executor(...) (not using __new__ + __post_init__) with SLURM_JOB_ID set
in os.environ, and patch RemoteExecutor.__init__ to assert that os.environ lacks
SLURM_JOB_ID (i.e., the cleanup in Executor.__init__ ran) before delegating to
the original RemoteExecutor.__init__; use the same test helpers as other tests
(e.g., _make_executor or patch) and ensure the test fails if SLURM_JOB_ID is not
removed so the moved cleanup path in Executor.__init__ is exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: de74c8aa-668c-4c49-9b3f-0ed28df0bb6c
📒 Files selected for processing (2)
snakemake_executor_plugin_slurm/__init__.pytests/test_cli.py
|
Thanks for this PR! At the Snakemake Hackathon I noticed, that even when unsetting all |
|
@jayhesselberth I am actually fine with this PR. Will you apply What I meant by my last remark: If you have an order of commands which solves the start-within-jobcontext-issue, I am eager to learn. |
|
@cmeesters in our case, it was a combination of this fix and not having |
cmeesters
left a comment
There was a problem hiding this comment.
@jayhesselberth Ok, I will fix the formatting prior to the next release, but will merge it already.
|
While this definitely seems to make it more robust, if started from a compute node, I had my workflow hanging today for 4.5h without submitting any jobs before I had to manually cancel, unlock and restart it: |
|
@gernophil I'm afraid, this is probably a different issue. Please write a comprehensive description as a stand-alone issue report. |
|
@cmeesters Thanks for the feedback. Might be related to wofklow profile vs. defining resources via CLI. I'll perform some more systematic tests and also talk to our Slurm admins and then come back with a stand-alone issue ones I know more. |
When running snakemake from within a SLURM job (e.g., an interactive session on a compute node), the pipeline would submit jobs but never detect their completion, hanging forever.
The RemoteExecutor base class starts a status-checking daemon thread in
__init__before__post_init__is called. The SLURM plugin'swarn_on_jobcontext()in__post_init__would sleep 5 seconds and then delete SLURM environment variables, but by then the daemon thread had already started and would silently die after its first polling cycle.Fix: move the SLURM environment detection and cleanup into
__init__, beforesuper().__init__()starts the daemon thread. Remove the now unnecessarywarn_on_jobcontext()method and its 5-second sleep.Summary by CodeRabbit